03_C_TSGraphs_Lagplots_Autocorrelation

library(fpp3)

References

This is not original material but a compilation of the following sources with some additional examples and clarifications introduced by the professor:

  1. Hyndman, R.J., & Athanasopoulos, G., 2021. Forecasting: principles and practice. 3rd edition.
  2. Additional material provided by the book authors

Lagged variables and time-plots

Given a time series we can compute its lags by shifting it in time a given number of steps. The lagged variable simply contains past information of the variable, which has been shifted as many steps as indicated by the number of the lag.

In more technical terms: given a variable \(x_t\), its lags correspond to delayed versions of the variable, with the number of the lag indicating the time delay:

  • The first lag of a variable \(x_t\) is written as \(x_{t-1}\). At any point in time \(t\), the value of \(x_{t-1}\) is the value of \(x_t\) one timestep ago.
  • The second lag of a variable \(x_t\) is written as \(x_{t-2}\). At any point in time \(t\), the value of \(x_{t-2}\) is the value of \(x_t\) two timestep ago.
  • The n-th lag of a variable \(x_t\) is written as \(x_{t-n}\). At any point in time \(t\), the value of \(x_{t-n}\) is the value of \(x_t\) \(n\) timesteps ago.

Lagged variables are relevant in time series for multiple reasons. Primarily because:

  1. By studying the relationship between the lags {\(x_{t-1}\), \(x_{t-2}\), \(\cdots\), \(x_{t-n}\)} and \(x_t\), we can **understand if the variable \(x_t\) is related to its values some timesteps ago.
  2. By studying the relationship between the lags {\(x_{t-1}\), \(x_{t-2}\), \(\cdots\), \(x_{t-n}\)} and a second variable \(y_t\), we can understand if the variable \(y_t\) is related to what happened in the past in the variable x.
    • A typical business case: current sales might be related to investment in marketing some timesteps ago.

As with most things in life, this is better understood with an example:

recent_beer <- aus_production %>%
  filter(year(Quarter) >= 2000) %>%
  select(Quarter, Beer)

recent_beer
# A tsibble: 42 x 2 [1Q]
   Quarter  Beer
     <qtr> <dbl>
 1 2000 Q1   421
 2 2000 Q2   402
 3 2000 Q3   414
 4 2000 Q4   500
 5 2001 Q1   451
 6 2001 Q2   380
 7 2001 Q3   416
 8 2001 Q4   492
 9 2002 Q1   428
10 2002 Q2   408
# ℹ 32 more rows

With this code, we generate the lagged values of the time series corresponding to the variable Beer:

for (i in seq(1, 4)) {
  lag_name = paste0("Beer_lag", as.character(i))
  recent_beer[[lag_name]] = lag(recent_beer[["Beer"]], i)
}

recent_beer
# A tsibble: 42 x 6 [1Q]
   Quarter  Beer Beer_lag1 Beer_lag2 Beer_lag3 Beer_lag4
     <qtr> <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
 1 2000 Q1   421        NA        NA        NA        NA
 2 2000 Q2   402       421        NA        NA        NA
 3 2000 Q3   414       402       421        NA        NA
 4 2000 Q4   500       414       402       421        NA
 5 2001 Q1   451       500       414       402       421
 6 2001 Q2   380       451       500       414       402
 7 2001 Q3   416       380       451       500       414
 8 2001 Q4   492       416       380       451       500
 9 2002 Q1   428       492       416       380       451
10 2002 Q2   408       428       492       416       380
# ℹ 32 more rows

If we call beer \(y_t\) (the time series we are studying), the formal notation for the variables above would be:

  • \(y_t \rightarrow\) Beer production, the variable we are studying
  • \(y_{t-1} \rightarrow\) Beer_lag1 - first lag of the variable beer
  • \(y_{t-2} \rightarrow\) Beer_lag2 - second lag of the variable beer
  • \(y_{t-3} \rightarrow\) Beer_lag3 - third lag of the variable beer
  • \(y_{t-4} \rightarrow\) Beer_lag4 - fourth lag of the variable beer

It is important to remark that:

  • Because there is no information to fill the void generated when shifting the variable, these voids are filled with NAs.
    • Note that the lagged variables are time series, but shorter than the original variable due to these NAs.
  • The rest of the values of the time series temain the same

This results in the following structure of NAs as the lag number increases:

Let us now create a timeplot of these lags. If we focus for example on Beer (\(y_t\)) and Beer_lag4 (\(y_{t-4}\)), we observe that each value of the variable Beer is confronted to its value 4 time steps ago. For example the value of beer at 2001 Q1 is next to its value at 2000 Q1 (exactly four time-steps ago)

recent_beer %>% select(Beer, Beer_lag4)
# A tsibble: 42 x 3 [1Q]
    Beer Beer_lag4 Quarter
   <dbl>     <dbl>   <qtr>
 1   421        NA 2000 Q1
 2   402        NA 2000 Q2
 3   414        NA 2000 Q3
 4   500        NA 2000 Q4
 5   451       421 2001 Q1
 6   380       402 2001 Q2
 7   416       414 2001 Q3
 8   492       500 2001 Q4
 9   428       451 2002 Q1
10   408       380 2002 Q2
# ℹ 32 more rows

The code below creates a time-plot of both variables (change n_lag if you wish to depict another lag):

n_lag = 4
lag_name = paste0("Beer_lag", n_lag)

recent_beer %>% 
  autoplot() +
  scale_x_yearquarter(breaks = "1 years",
                      minor_breaks = "1 year") +
  geom_line(aes_string(x = "Quarter", y = lag_name), # Call attention upon aes_string
            color = "red",
            linetype = "dashed")
Plot variable not specified, automatically selected `.vars = Beer`
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
Warning: Removed 4 rows containing missing values (`geom_line()`).

We can see that, for the 4-th lag, the lagged variable and the original variable are perfectly in phase. This is because this series has a seasonal period of 1 year (4 quarters) and simple seasonality (meaning only this yearly seasonality is present).

Lagged-variables and scatterplots. Autocorrelation.

Lagged variables enable us to study the extent to which the value of the series at time t (the original series) depends upon the values of the series at past time steps.

Let us compute a scatterplot depicting the relation between the fourth lag of the variable beer and the original variable. That is, we are going to forget about time and are simply going to confront values of the series Beer and its fourth lag Beer_lag4. These are the values signalled in yellow in the tsibble below (note that NAs have of course been excluded):

The code to attain that is:

recent_beer %>%
  gg_lag(y = Beer, geom = "point", lags = 4)

We could summarize the degree of linear relationship between these two variables using, for example the pearsons correlation coefficient. However, we will see there is a better metric in the case of time series

Let us compute these scatterplots for a number of lags. This is what is calles a lagplot of the time series:

recent_beer %>%
  gg_lag(y = Beer, geom = "point", lags = 1:12)

In the figure above, you see that the relationship between the variable and the lagged variables at multiples of the seasonal period (m = 4, 8, 12…) is much stronger.

Autocorrelation vs. Pearsons correlation coefficient

Let us recall the formula for the Pearson’s correlation coefficient between two variables x and y

\[\begin{align*} r_{k} = \frac{\sum(x_{i}-\bar{x})(y_{i}-\bar{y})} {\sqrt{\sum(x_{i}-\bar{x})^2\sum(y_{i}-\bar{y})^2}} \end{align*}\]

If we applied this formula to each of the lagplots in the previous section, the number of terms in the sums in the denominator would decrease as the lag number increases. This would be due to the fact that the lagged time series has less points than the original series.

We can use this fact to create a new correlation coefficient that considers the fact that, the further away in the past a lag is, the smaller its influence over the original time series should be. This is what we call the autocorrelation coefficient between the time series \(y_t\) and its k-th lag \(y_{t-k}\):

\[\begin{align*} r_{k} = \frac{\sum\limits_{t=k+1}^T (y_{t}-\bar{y})(y_{t-k}-\bar{y})} {\sum\limits_{t=1}^T (y_{t}-\bar{y})^2} \end{align*}\]

In the above formula:

  • \(t = 1\) is the first point in the time series
  • \(t = T\) is the last poitn for which we have recorded data
  • The denominator remains the same for the correlation coefficient of every lag. The sum in the denominator extends over the totality of the original time series (from \(t=1\) to \(t=T\)).
  • The numerator has a decreasing number of terms as \(k\) (the lag-number) increases. The sum in the numerator extends from \(k+1\) to \(T\).
  • Example: if \(k = 1\) (first lag), the first value of the lagges seris (\(t=1\)) is an NA. The sum extends therefore from \(t=k+1=2\) until the end

This simple fact inherently decreases the autocorrelation coefficient of lags that are further away in the past because the denominator remains constant while the numerator decreases with increasing k.

IMPORTANT: this does not mean that autocorrelation coefficients in the past cannot be high. It means that the autocorrelation coefficient has this built-in feature that scales the coefficient depending on the point in time it is referring to.

Computing the autocorrelation coefficients in R

The autocorrelation coefficients may be easily computed using the function ACF()

recent_beer %>%
  ACF()
Response variable not specified, automatically selected `var = Beer`
# A tsibble: 16 x 2 [1Q]
        lag      acf
   <cf_lag>    <dbl>
 1       1Q -0.0530 
 2       2Q -0.758  
 3       3Q -0.0262 
 4       4Q  0.802  
 5       5Q -0.0775 
 6       6Q -0.657  
 7       7Q  0.00119
 8       8Q  0.707  
 9       9Q -0.0888 
10      10Q -0.588  
11      11Q  0.0241 
12      12Q  0.632  
13      13Q -0.0660 
14      14Q -0.505  
15      15Q -0.00891
16      16Q  0.498  

ACF Plot (a.k.a Correlogram)

If we now compute the autocorrelation coefficient corresponding to each lag, we can create a graph with the following axes:

  • x-axis: the lag number.
  • y-axis: the value of the corresponding autocorrelation coefficient.

This is what we call the lagplot. The lagplot can be easily depicted as follows:

recent_beer %>%
  ACF() %>%
  autoplot()
Response variable not specified, automatically selected `var = Beer`

Number of lags to be depicted

With the argument lag_max we can specify the number of lags to be computed and subsequently depicted in the correlogram. This is important because, for seasonality to appear in the correlogram, a sufficient number of lags needs to be computed. See the upcoming sections for further details.

recent_beer %>%
  ACF(lag_max = 24) %>%
  autoplot()
Response variable not specified, automatically selected `var = Beer`

Tend and seasonaity in ACF plots

THIS SECTION IS REALLY IMPORTANT AND ALL STUDENTS MUST LEARN TO RECOGNIZE THESE PATTERNS AND THEIR SUBTLETIES. The assignments and exam will require this due to its importance.

When data have a trend:

  • The autocolleration coefficients for small lags (please note the small lags appreciation) tend to be large and positive because observations nearby in time are also nearby in value. And so the ACF of a trended time series tends to have positive values that slowly decrease as the lags increase. - Please focus only on the region circled in red and not on the region where the ACF becomes negative in this case. What is important to recognize the trend is this slowly decreasing ACFs that are positive at the beginning.

  • Note that the pattern is the same for a negative trend:

When data are seasonal:

  • The autocorrelation will be larger for the seasonal lags (at multiples of the seasonal period) than for other lags..
    • NOTE: the autocorrelation at higher multiples of the seasonal period will be smaller than at lower multiples of the seasonal period because the former are further away in the past and the formula for the autocorrelaiton coefficient accounts for this fact, as we previously explained.

When data are both trended and seasonal

  • You see a combination of these effects. Depending on the relative strength of the trend and teh seasonality, the pattern varies. Below a description of some of the most common cases. These have been generated using this shiny app:

Web-app - Time Series autocorrelation - basic patterns

  • Seasonal and trend data are of the same order (one does not dominate over the other):

Let us look at the example of the sales of Australian antidiabetic drugs:

  • Seasonal component weaker than trend component: waves due to seasonal component can be seen on the ACF… but the ACF alone does not reveal the actual seasonal period (lag 12 not higher than the rest). This does not mean that there is no seasonality, only that it is not visible on the ACF. With time series decomposition (next session), we will be able to extract the seasonal component.

  • Seasonal component much weaker than trend component: ACF appears to be that of a trend. However this does not mean that there is no seasonality, only that the ACF does not reflect it. With time series decomposition (next session), we will be able to extract the seasonal component.

Another relevant case to consider is when the trend component is much weaker than the seasonal component. Then the trend would barely be noticeable on the correlogram. Again, this does not mean that there is no trend, it just means that the correlogram does not reflect it because it is masked by the seasonal component.

White noise

Time series that show no autocorrelation are called white noise.

# Sets a seed to ensure repeatibility of random functions
set.seed(30)

y <- tsibble(sample = 1:50, wn = rnorm(50), index = sample)

y %>% autoplot(wn) + labs(title = "White noise", y = "")

In particular, the above series samples from a gaussian distribution. In this case the noise is called gaussian white noise. The process for generating such a gaussian white noise is sampling from a normal distribution as follows:

Lets have a look at the correlogram:

y %>%
  ACF(wn) %>%
  autoplot() + labs(title = "White noise")

For white noise series, autocorrelation is expected to be close to 0, but there is some random variation that will make it deviate from zero.

  • We expect 95% of the spikes in the ACF to lie within \(\pm 2/\sqrt{T}\), where \(T\) is the length of the time series.
  • Checking this is similar to performing separate statistical tests on each autocorrelation coefficients.
    • Therefore, we may have false positives.

If:

  • One or more large spikes are outside these bounds
  • Substantially more than 5% of the spikes are outside these bounds

Then the time series is probably not white noise..

Later on in the course we will see how to perform statistical statistical that encompass multiple autocorrelation coefficients. For now the above criteria should suffice.

Exercise 1 - ACF Plot patterns

With the concepts we have seen in this class, you should be able to tell which time series corresponds to each correlogram:

3 - D - series exhibits a clear trend and D is the only correlogram that matches 
this. We also observe spikes at lag 12. Since this is monthly data, this 
corresponds to yearly seasonality. It matches the data structure.

4 - C - the series exhibits cycles of approximately 10 years. These cycles seem 
pretty regular. The only correlogram showing this behavior is correlogram C. 
Remember these are cycles and not seasonality. We are dealing with yearly data.

2 - A - correlogram exhibits yearly seasonality (spikes at lag 12 for monthly data), 
matching the time series. No trend in the data.

1 - Signal from sensor. More erratic behavior and a bit of downwards trend. 
No seasonality and a bit of trend. The only correlogram matching this is B.

Exercise 2 - time series graphs

Use the following graph functions: autoplot(), ACF() and explore features of the time series Total Private from the dataset us_employment (loaded with fpp3):

us_employment
# A tsibble: 143,412 x 4 [1M]
# Key:       Series_ID [148]
      Month Series_ID     Title         Employed
      <mth> <chr>         <chr>            <dbl>
 1 1939 Jan CEU0500000001 Total Private    25338
 2 1939 Feb CEU0500000001 Total Private    25447
 3 1939 Mar CEU0500000001 Total Private    25833
 4 1939 Apr CEU0500000001 Total Private    25801
 5 1939 May CEU0500000001 Total Private    26113
 6 1939 Jun CEU0500000001 Total Private    26485
 7 1939 Jul CEU0500000001 Total Private    26481
 8 1939 Aug CEU0500000001 Total Private    26848
 9 1939 Sep CEU0500000001 Total Private    27468
10 1939 Oct CEU0500000001 Total Private    27830
# ℹ 143,402 more rows
total_private <- us_employment %>%
  filter(Title == "Total Private")

total_private
# A tsibble: 969 x 4 [1M]
# Key:       Series_ID [1]
      Month Series_ID     Title         Employed
      <mth> <chr>         <chr>            <dbl>
 1 1939 Jan CEU0500000001 Total Private    25338
 2 1939 Feb CEU0500000001 Total Private    25447
 3 1939 Mar CEU0500000001 Total Private    25833
 4 1939 Apr CEU0500000001 Total Private    25801
 5 1939 May CEU0500000001 Total Private    26113
 6 1939 Jun CEU0500000001 Total Private    26485
 7 1939 Jul CEU0500000001 Total Private    26481
 8 1939 Aug CEU0500000001 Total Private    26848
 9 1939 Sep CEU0500000001 Total Private    27468
10 1939 Oct CEU0500000001 Total Private    27830
# ℹ 959 more rows

Exercise 3 - time series graphs

The PBS dataset included in fpp3 contains Monthly Medicare Australia prescription data. Run ?PBS on the console for further info.

We would like to focus on the prescriptions with ATC (Anatomical Therapeutic Chemical Index level 2) equal to H02. This group corresponds to corticosteroids for systemic use.

PBS %>%
  filter(ATC2 == "H02") %>%
  select(ATC2, Month, Cost, everything()) %>%
  arrange(Month)
# A tsibble: 816 x 9 [1M]
# Key:       Concession, Type, ATC1, ATC2 [4]
   ATC2     Month   Cost Concession   Type     ATC1  ATC1_desc ATC2_desc Scripts
   <chr>    <mth>  <dbl> <chr>        <chr>    <chr> <chr>     <chr>       <dbl>
 1 H02   1991 Jul 317384 Concessional Co-paym… H     Systemic… CORTICOS…   63261
 2 H02   1991 Jul  82929 Concessional Safety … H     Systemic… CORTICOS…   12211
 3 H02   1991 Jul  20178 General      Co-paym… H     Systemic… CORTICOS…    4013
 4 H02   1991 Jul   9304 General      Safety … H     Systemic… CORTICOS…    2101
 5 H02   1991 Aug 269891 Concessional Co-paym… H     Systemic… CORTICOS…   53528
 6 H02   1991 Aug 100602 Concessional Safety … H     Systemic… CORTICOS…   14827
 7 H02   1991 Aug  16519 General      Co-paym… H     Systemic… CORTICOS…    3397
 8 H02   1991 Aug  13894 General      Safety … H     Systemic… CORTICOS…    2970
 9 H02   1991 Sep 269703 Concessional Co-paym… H     Systemic… CORTICOS…   52822
10 H02   1991 Sep 128900 Concessional Safety … H     Systemic… CORTICOS…   18896
# ℹ 806 more rows
  1. As you can see, after filtering for ATC2 == H02 there are 4 entries per month corresponding to different combinations of the columns Concession and Type. We would like to add the Cost associated to those 4 entries for every month, to obtain a time series of the total cost per month. Use index_by() and summarise() to get the total cost per month

  2. Create and interpret the following graphs

  • Time-plot with sufficient resolution to visually identify the seasonality
  • ACF plots and lag plots to identify the seasoanlity

Exercise 4 - time series graphs

The us_gasoline dataset, loaded with fpp3, contains weekly data beginning on Week 6 1991 and ending on Week 3 2017. Units are “million barrels per day”. This is the time series you need to analyse

head(us_gasoline)
# A tsibble: 6 x 2 [1W]
      Week Barrels
    <week>   <dbl>
1 1991 W06    6.62
2 1991 W07    6.43
3 1991 W08    6.58
4 1991 W09    7.22
5 1991 W10    6.88
6 1991 W11    6.95

Create and interpret the following graphs:

  1. Timeplot with sufficient resolution to spot the beginning of each year.
  2. Lagplots and correlogram to examine if there is any seasonality.